Journal of Computer Applications

    Next Articles

Text semantic de-duplication algorithm based on keyword graph representation#br#
#br#

  

  • Received:2022-10-09 Revised:2022-11-29 Accepted:2022-12-02 Online:2023-04-12 Published:2023-04-12

基于关键词图表示的文本语义去重算法

汪锦云,向阳   

  1. 同济大学 电子与信息工程学院,上海 201800
  • 通讯作者: 向阳
  • 基金资助:
    企业知识图谱构建中的开放域知识抽取及可信与时序性研究

Abstract: There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts wasted storage space and reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication methods mostly rely on literal information, and could not capture the interaction information between sentences that are far away in long text, so the de-duplication effect is not ideal. To solve the problem of text semantic de-duplication, a semantic de-duplication algorithm based on keyword graph representation was proposed. First, the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly, the nodes were encoded in various ways, and then the Graph Attention Network (GAT) was used to learn the relationship between nodes to obtain the vector representation of text pairs, and to judge whether the text pairs were semantically similar. Finally, the similar text was de-duplicated according to the text pair’s semantical similarity. Compared with the traditional methods, it can effectively use the semantic information of the text, and through the graph structure, it can connect the distant sentences in the long text through the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experiments show that the proposed algorithm performs better than the traditional algorithms, such as Simhash, BERT (Bidirectional Encoder Representation from Transformers) fine-tuning and CIG (Concept Interaction Graph), on both CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) datasets. The F1 score in the CNSE dataset reaches 84.65%, and the F1 score in the CNSS dataset reaches 90.76%, indicating that the proposed algorithm can effectively improve the effect of text de-duplication tasks.

Key words: text semantic de-duplication, keyword extraction, text matching, graph representation, Graph Attention Network (GAT)

摘要: 网络中存在大量语义相同或者相似的冗余文本,文本去重能够解决冗余文本浪费存储空间等问题,并能为信息抽取任务减少不必要的消耗。传统的文本去重方法依赖于文字重合度信息,没有很好地利用文本语义信息,同时也无法捕捉长文本中距离较远句子之间的交互信息,去重效果不够理想。针对文本语义去重问题,提出一种基于关键词图表示的语义去重算法。首先,通过抽取文本对中的语义关键词短语,将文本对表示为以关键词短语为结点的图;其次,通过多种方式对结点进行编码,利用图注意力网络(GAT)学习结点之间的关系得到文本对图的向量表示,并判断文本对是否语义相似;最后,根据文本对语义相似度进行去重处理。与传统方法相比,所提算法能够有效地利用文本的语义信息,并且通过图结构将长文本中距离较远的句子通过关键词短语的共现关系进行连接,增加不同句子之间的语义交互。实验结果表明,所提算法在两个公开数据集CNSE和CNSS上都取得了相较于Simhash、BERT微调、概念交互图(CIG)等传统算法更好的表现,在CNSE数据集的F1值达到84.65%,CNSS数据集的F1值达到90.76%,说明所提算法可以有效提升文本去重任务的效果。

关键词: 文本语义去重, 关键词抽取, 文本匹配, 图表示, 图注意力网络

CLC Number: